# **Course Learning Outcomes**

Students shall demonstrate an understanding of the internal organization of a computer system through assembly language.
 Students shall design and simulate the data path and the control unit of a simple computer based on an instruction set.
 Students shall demonstrate an understanding of pipelining including instruction sequencing, register value forwarding, data interlocking.
 Students shall demonstrate an understanding of the basic concepts of multiprocessor and multi-core designs.
 Students shall demonstrate an understanding of the history and possible future of the field necessary for staying at the forefront of computing systems development (life-long learning).

### **Course Learning Outcomes**

- Students shall demonstrate an understanding of the internal organization of a computer system through assembly language.
- Students shall design and simulate the data path and the control unit of a simple computer based on an instruction set.
- Students shall demonstrate an understanding of pipelining including instruction sequencing, register value forwarding, data interlocking.
- Students shall demonstrate an understanding of the basic concepts of multiprocessor and multi-core designs.
- Students shall demonstrate an understanding of the history and possible future of the field necessary for staying at the forefront of computing systems development (life-long learning).





Chips with multiple processors (cores)



#### Hardware and Software Hardware Serial: e.g., Pentium 4 Parallel: e.g., quad-core Xeon e5345 Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel hardware Chapter 7 — Multicores, Multiprocessors, and Clusters — 5 **Parallel Programming** Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it's easier!

- Difficulties
  - Partitioning
  - Coordination
  - Communications overhead



### Amdahl's Law



- Example: 100 processors, 90× speedup?
  - $T_{new} = T_{parallelizable} / 100 + T_{sequential}$
  - Speedup =  $\frac{1}{(1 F_{\text{parallelizable}}) + F_{\text{parallelizable}}/100} = 90$
  - Solving: F<sub>parallelizable</sub> = 0.999
- Need sequential part to be 0.1% of original time

Chapter 7 — Multicores, Multiprocessors, and Clusters — 8

## **Scaling Example**

- Workload: sum of 10 scalars, and 10 × 10 matrix sum
  - Speed up from 10 to 100 processors
- Single processor: Time =  $(10 + 100) \times t_{add}$
- 10 processors
  - Time =  $10 \times t_{add} + 100/10 \times t_{add} = 20 \times t_{add}$
  - Speedup = 110/20 = 5.5 (55% of potential)
- 100 processors
  - Time =  $10 \times t_{add} + 100/100 \times t_{add} = 11 \times t_{add}$
  - Speedup = 110/11 = 10 (10% of potential)
- Assumes load can be balanced across processors







### **Example: Sum Reduction**



#### **Message Passing**

- Each processor has private physical address space
- Hardware sends/receives messages between processors





# <section-header><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item>

# **Grid Computing**

- Separate computers interconnected by long-haul networks
  - E.g., Internet connections
  - Work units farmed out, results sent back
- Can make use of idle time on PCs
  - E.g., SETI@home, World Community Grid





- Instructions from independent threads execute when function units are available
- Within threads, dependencies handled by scheduling and register renaming





# **Instruction and Data Streams**

#### An alternate classification

|                        |          | Data Streams                       |                               |
|------------------------|----------|------------------------------------|-------------------------------|
|                        |          | Single                             | Multiple                      |
| Instruction<br>Streams | Single   | SISD:<br>Intel Pentium 4           | SIMD: SSE instructions of x86 |
|                        | Multiple | <b>MISD</b> :<br>No examples today | MIMD:<br>Intel Xeon e5345     |

SPMD: Single Program Multiple Data

- A parallel program on a MIMD computer
- Conditional code for different processors



# **History of GPUs**

- Early video cards
  - Frame buffer memory with address generation for video output
- 3D graphics processing
  - Originally high-end computers (e.g., SGI)
  - Moore's Law  $\Rightarrow$  lower cost, higher density
  - 3D graphics cards for PCs and game consoles
- Graphics Processing Units
  - Processors oriented to 3D graphics tasks
  - Vertex/pixel processing, shading, texture mapping, rasterization

 $Chapter \ 7 - Multicores, Multiprocessors, and \ Clusters - 30$ 

#### **GPU Architectures**

- Processing is highly data-parallel
  - GPUs are highly multithreaded
  - Use thread switching to hide memory latency
    Less reliance on multi-level caches
  - Graphics memory is wide and high-bandwidth
- Trend toward general purpose GPUs
  - Heterogeneous CPU/GPU systems
  - CPU for sequential code, GPU for parallel code
- Programming languages/APIs
  - DirectX, OpenGL
  - C for Graphics (Cg), High Level Shader Language (HLSL)
  - Compute Unified Device Architecture (CUDA)



# Example: NVIDIA Tesla



#### Warp: group of 32 threads

- Executed in parallel,
- SIMD
  - 8 SPs
  - × 4 clock cycles







# **Network Characteristics**

- Performance
  - Latency per message (unloaded network)
  - Throughput
  - Congestion delays (depending on traffic)
- Cost
- Power
- Routability in silicon



Chapter 7 — Multicores, Multiprocessors, and Clusters — 38

# **Concluding Remarks**

- Goal: higher performance by using multiple processors
- Difficulties
  - Developing parallel software
  - Devising appropriate architectures
- Many reasons for optimism
  - Changing software and application environment
  - Chip-level multiprocessors with lower latency, higher bandwidth interconnect
- An ongoing challenge for computer architects!

